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Abstract 

Decentralised optimisation tasks are important components of multi- 
agent systems. These tasks can be interpreted as n-player potential games: 
therefore game-theoretic learning algorithms can be used to solve decen- 
tralised optimisation tasks. Fictitious play is the canonical example of 
these algorithms. Nevertheless fictitious play implicitly assumes that play- 
ers have stationary strategies. We present a novel variant of fictitious play 
where players predict their opponents' strategies using Extended Kalman 
filters and use their predictions to update their strategies. 

We show that in 2 by 2 games with at least one pure Nash equilibrium 
and in potential games where players have two available actions, the pro- 
posed algorithm converges to the pure Nash equilibrium. The performance 
of the proposed algorithm was empirically tested, in two strategic form 
games and an ad-hoc sensor network surveillance problem. The proposed 
algorithm performs better than the classic fictitious play algorithm in 
these games and therefore improves the performance of game-theoretical 
learning in decentralised optimisation. 

Keywords: Multi-agent learning, game theory, fictitious play, decen- 
tralised optimisation, learning in games, Extended Kalman filter. 



1 Introduction 



Recent advance in technology render decentralised optimisation a crucial com- 
ponent of many applications of multi agent systems and decentralised control. 



Sensor networks (Kho et al. 2009), traffic control (van Leeuwen et al. 2002) 



and scheduling problems (Stranjak et al. 2008) are some of the tasks where 



decentralised optimisation can be used. These tasks share common character- 
istics such as large scale, high computational complexity and communication 
constraints that make a centralised solution intractable. It is well known that 



many decentralised optimisation tasks can be cast as potential games (Wolpcrt 



and Turner 1999. Arslan et al. 20061, and the search of an optimal solution 
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can be seen as the task of finding Nash equilibria in a game. Thus it is feasi- 
ble to use iterative learning algorithms from game-theoretic literature to solve 
decentralised optimisation problems. 

A game theoretic learning algorithm with proof of convergence in certain 



kinds of games is fictitious play (Fudenberg and Levine 1998 Monderer and 



Shapley 1996). It is a learning process where players choose an action that 



maximises their expected rewards according to the beliefs they maintain about 
their opponents' strategies. The players update their beliefs about their oppo- 
nents' strategies after observing their actions. Even though fictitious play con- 
verges to Nash equilibrium, this convergence can be very slow. This is because 
it implicitly assumes that other players use a fixed strategy in the whole game. 



Smyrnakis and Leslie (2010) addressed this problem by representing the ficti- 



tious play process as a state space model and by using particle filters to predict 
opponents' strategies. The drawback of this approach is the computational cost 
of the particle filters that render difficult the application of this method in real 
time applications. 

The alternative that we propose in this article is to use instead of particle 
filters, extended Kalman filters (EKF) to predict opponents' strategies. There- 
fore the proposed algorithm has smaller computational cost than the particle 
filter variant of fictitious play algorithm that proposed by [Smyrnakis and Le slie 
(2010). We show that the EKF fictitious play algorithm converges to a pure 
Nash equilibrium, in 2 by 2 games with at least one pure Nash equilibrium and 
in potential games where players have two available actions. We also empirically 
observe, in a range of games, that the proposed algorithm needs less iterations 
than the classic fictitious play to converge to a solution. Moreover in our simu- 
lations, the proposed algorithm converged to a solution with higher reward than 
the classic fictitious play algorithm. 

The remainder of this paper is organised as follows. We start with a brief 
description of game theory, fictitious play and extended Kalman filters. Section 
[3] introduces the proposed algorithm that combines fictitious play and extended 
Kalman filters. The convergence results we obtained are presented in Section 
[4] In Section [5] we propose some indicative values for the EKF algorithm pa- 
rameters. Section [6] presents the simulation results of EKF fictitious play in a 
2x2 coordination game, a three player climbing hill game and an ad- hoc sensor 
network surveillance problem. In the final section we present our conclusions. 



2 Background 

In this section we introduce some definition from game theory that we will 
use in the rest of this article and the relation between potential games and 
decentralised optimisation. We also briefly present the classic fictitious play 
algorithm and the extended Kalman filter algorithm. 

2.1 Game theory definitions 

We consider a game T with I players, where each player i,« = 1,2,..., I, choose 
his action, s l , from a finite discrete set S l . We then can define the joint action 
that is played in a game as the set product 5 = x^S 1 . Each Player i receive a 
reward, u l , after choosing an action . The reward is a map from the joint action 
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space to the real numbers, u l : S — > R. We will often write s = (s\ s~ l ), where 
s* is the action of Player i and s~ l is the joint action of Player i's opponents. 
When players select their actions using a probability distribution they use mixed 
strategies. The mixed strategy of a player i, a 1 , is an element of the set A 1 , 
where A 1 is the set of all the probability distributions over the action space S l . 
The joint mixed strategy, a, is then an element of A = x*^![A\ Analogously 
to the joint actions we will write a — (<t\<7 _4 ). In the special case where the 
players choose an action with probabiity one we will say that players choose 
their actions using pure strategies. The expected utility a player i will gain if 
he chooses a strategy a % (resp. s l ), when his opponents choose the joint strategy 
cr _l is u l (a\a~ l ) (resp. u % {s l , . 

A common decision rule in game theory is best response (BR). The best 
response is defined as the action that maximizes players' expected utility given 
their opponents' strategies. Thus for a specific opponents' strategy a~ l we 
evaluate the best response as: 

BR^a-t) = argmax «*(«', tr - *) (1) 



Nash ( 1950 ) showed that every game has at least one equilibrium, which is 
a fixed point of the best response correspondence, a % € BR(a~ l ). Thus when a 
joint mixed strategy a is a Nash equilibrium then: 

u l (a\ a~ l ) > u\s\ a~ l ) for all s l G S i (2) 

Equation [2] implies that if a strategy a is a Nash equilibrium then it is not 
possible for a player to increase his utility by unilaterally changing his strategy. 
When all the players in a game select their actions using pure strategies then 
the equilibrium actions are referred as pure strategy Nash equilibria. A pure 
equilibrium is strict if each player has a unique best response to his opponents 
actions. 



2.2 Decentralised optimisation tasks as potential games 

A class of games that are of particular interest in multi agent systems and 
decentralised optimisation tasks are potential games, because of their utility 
structure. In particular in order to be able to solve an optimisation task de- 
centrally the local functions should have similar characteristics with the global 
function that we want to optimise. This suggests that an action which improves 
or reduces the utility of an individual should respectively increase or reduce 
the global utility. Potential games have this property, since the potential func- 
tion (global function) depict the changes in the players' payoffs (local functions) 
when they unilaterally change their actions. More formally we can write 

u\s\ a"*) - = <t>{s\ « -< ) - <p(s\s- 1 ) 

where <j> is a potential function and the above equality stands for every player 
i, for every action s~ 4 e S~ l , and for every pair of actions s l , s l £ S l , where S l 
and S~ l represent the set of all available actions for Player i and his opponents 
respectively. Moreover potential games has at least one pure Nash equilibrium, 
hence there is at least one joint action s where no player can increase their 
reward, therefore the potential function, through a unilateral deviation. 
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It is feasible to choose an appropriate form of the agents' utility function 
in order for the global utility to act as a potential of the system. Wonderful 



life utility is a utility function that introduced by Wolpcrt and Turner ( 1999 



and applied by Arslan et al. ( 2006 ) to formulate distributed optimisation tasks 
as potential games. Player i's utility, when wonderful life utility is used, can 
be defined as the difference between the global utility u g and the utility of the 
system when a reference action is used as player's i action. More formally when 
player i chooses an action s 1 we write 

where s denotes the reference action of player i. Hence the decentralised op- 
timisation problem can be cast as a potential game and any algorithm that is 
proved to converge to a Nash equilibrium of a potential game, which is a local 
or the global optimum of the optimisation problem, will converge to a joint 
action from which no player can increase the global reward through unilateral 
deviation. 

2.3 Fictitious play 



Fictitious play ( Brown[ |1951[ ), is a widely used learning technique in game the- 



ory. In fictitious play each player chooses his action according to the best 
response to his beliefs about his opponents' joint mixed strategy <r _ \ 

Initially each player has some prior beliefs about the strategy that each of 
his opponents uses to choose an action based on a weight function n t . The play- 
ers, after each iteration, update the weight function and therefore their beliefs 
about their opponents' strategies and play again the best response according 
to their beliefs. More formally in the beginning of a game Player i maintains 
some arbitrary non-negative initial weight functions n 3 , Vj G [1, that are 
updated using the formula: 

for each j, where J._ , , = j ^ ^ ' f* ~ S 
s t- SJ otherwise. 

The mixed strategy of opponent j is estimated from the following formula: 

4{ S i) = 4^1. . . ( 3 ) 

J2 s 'esi K t ( s 

Player i based on his beliefs about his opponents' strategies, chooses the 
action which maximises his expected payoffs. When player i uses equation ^ 
to update the beliefs about his opponents' strategies he treats the environment of 
the game as stationary and implicitly assumes that the actions of the players are 
sampled from a fixed probability distribution. Therefore the recent observations 
have the same weight as the initial ones. This approach leads to poor adaptation 
when the other players choose to change their strategies. 

2.4 Fictitious play as a state space model 



We follow Smyrnakis and Leslie (2010) and we will represent fictitious play 



process as a state-space model. According to this state space model each player 
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has a propensity Q\(s l ) to play each of his available actions s l G S l , and then 
he forms his strategy based on these propensities. Finally he chooses his actions 
based on his strategy and the best response decision rule. Because players have 
no information about the evolution of their opponents' propensities, and under 
the assumption that the changes in propensities are small from one iteration 
of the game to another, we model propensities using a Gaussian autoregressive 
prior on all propensities. We set Qq ~ N(0, 1) and recursively update the value 
of Q t according to the value of Qt-i as follows: 

Q(st) = Q(st-i) + vt 

where rjt ~ N(Q,x 2 I)- The action of a player then is related to his propensity 
by the following sigmoid equation for every s l G S l 

e (QV)/r) 

Therefore players will assume that at every iteration t their opponents have a 
different strategy at- 

2.5 Kalman filters and Extended Kalman filters 

Our objective is to estimate player i's opponent propensity and thus to estimate 
the marginal probability p(Qt,Si : t). This objective can be represented as a 
Hidden Markov Model (HMM). HMMs are used to predict the value of an 
unobserved variable Xt, the hidden state, using the observations of another 
variable z\ :t . There are two main assumptions in the HMM representation. The 
former one is that the probability of being at any state Xt at time t depends 
only at the state of time t — 1, x t -\- The latter one is that an observation 
at time t depends only on the current state Xt- One of the most common 
methods to estimate p(x\-t,Z\;t) is Kalman filters and its variations. Kalman 

) is based on two assumptions, the first is that the 
state variable is Gaussian. The second is that the observations are the result of 
a linear combination of the state variable. Hence Kalman filters can be used in 
cases which are represented as the following state space model: 

Xt =Axt-i +£t-i hidden layer 
yt =Bxt + Ct observations 

where £t and Q follow a zero mean normal distribution with covariance matrices 
H = q t I and Z = r t I respectively, and A, B are linear transformation matrices. 
When the distribution of the state variable xt is Gaussian then p(xt\yv.t) is also 
a Gaussian distribution, since yt is a linear combination of xt- Therefore it is 
enough to estimate its mean and variance to fully characterise p(xt\yi-.t)- 

Nevertheless in the state space model we want to implement, the relation 
between Player i's opponent propensity and his actions is not linear. Thus we 
should use a more general form of state space model such as: 

x t = f{x t -i) + £t 

y t = h(x t ) + ( t (4) 



filter (Kalman et ah) [i960 
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where £ t and Ct are the hidden and observation state noise respectively, with 
zero mean and covariance matrices S = q t I and Z = r t I respectively. The 
distribution of p{xt\yi-.t) is not a Gaussian distribution because /(•) and h(-) 
are non-linear functions. A simple method to overcome this shortcoming is to 
use a first order Taylor expansion to approximate the distributions of the sate 
space model in Q. In particular we let Xt — m>t-i + e : where m t denotes the 
mean of Xt and e ~ N(0,P). We can rewrite Q as: 



x t = f(m t -i + e) + w t -i = f{m t -i) + F x (m t - 
y t = h(m t + e) + (t = h(m t ) + H x (m t )e + ( t 



(5) 



where F x (mt-\) and H x (mt) is the Jacobian matrix of / and h evaluated at 
nit— 1 and mt, respectively. If we use the transformations in ([5| then p(xt\yi-t) 
is a Gaussian distribution. 

Since p{xt\yi-.t) is a Gaussian distribution to fully characterise it we need to 
evaluate its mean and its variance. The EKF process ( Jazwinski 1970 Grewal 



and Andrews 2011 1 estimates this mean and variance in two steps the prediction 
and the update step. In the prediction step at any iteration t the distribution 
of the state variable is estimated based on all the observations until time t — 1, 
p( x t\yi:t-i)- The distribution of p{xt\yi-.t—i) is Gaussian and we will denote its 
mean and variance as and Pf respectively. During the update step the 
estimation of the prediction step is corrected in the light of the new observation 
at time t, so we estimate p(xt\yv.t)- This is also a Gaussian distribution and we 
will denote its mean and variance as m t and Pt respectively. 

The prediction and the update steps of the EKF process ( Jazwinski) 1970| 
Grewal and Andrews 2011 ) to estimate the mean and the variance of p(xt\yi-.t—i) 
and p{x t \y\:t) respectively are the following: 
Prediction Step 

m t =/(wH-i) 

P t - =F(m t -i)Pt-iF(m t - 1 ) +S t _j 
where the element of F(m t ) is defined as 

_ df{xj,r) 



[F(mt)} 3 



dx 3 , 



\x—ra t ,g— 



Update Step 



Z 



v t = Zt — h(m t ) 

S t = H{mt)P t -H T {m-) 

K t = P t -H T (mt)Sr 1 

m t = nit + K t v t 

Pt = Pt-KtS t Kj 

where z% is the observation vector (with 1 in the entry of the observed action 
and everywhere else) and the j,j' element of H(m t ) is defined as: 



[H{mt)]j 



dh(xj, r) 
dx r 



\x—m i ,r— 
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3 Fictitious play and EKF 



For the rest of this paper we will only consider inference over a single opponent 
mixed strategy in fictitious play. Separate estimates will be formed identically 
and independently for each opponent. We therefore consider only one opponent, 
and we drop all dependence on player i, and write St, Ut and Qt for Player i's 
opponent's action, strategy and propensity respectively. Moreover for any vector 
x, x[j] will denote the jth element of the vector and for any matrix y, y[i, j] will 
denote the (i,j)th element of the matrix. 

We can use the following state space model to describe the fictitious play 
process: 

Qt = Qt-i + £t-i 
s t = KQ t ) + o 

where £ t _i ~ N(0, 5), is the noise of the state process and ( t is is the error of 
the observation state with zero mean and covariance matrix Z, which occurs 
because we approximate a discrete process like best responses, equation (JlJ, 
using a continuous function h(-). Hence we can combine the EKF with fictitious 
play as follows. At time t — 1 Player i has an estimation of his opponent's 
propensity using a Gaussian distribution with mean mt—i and variance Pt—i, 
and has observed an action s t -i- Then at time t he uses EKF prediction step 
to estimate his opponent's propensity. The mean and variance of p(Qt\si-.t-i) 
of the opponent's propensity approximation are: 

mj = rrit—i 
P t - = Pt., + E 

Player i then evaluates his opponents strategies using his estimations as: 

a t {s t ) = ex P( m tl s t}/T) 

J2seS ex P( m t[s}/ T ) 

where [st] is the mean of Player i's estimation about the propensity of his 
opponent to play action s t . Player i then uses the estimation of his opponent 
strategy , equation and best responses, equation 0, to choose an action. 
After observing the opponent's action s t , Player i correct his estimations about 
his opponent's propensity using the update equations of EKF process. The 
update equations are: 



v t = Zt- h(m t ) 

S t - H(mi)P t ~H T (mi) + Z 

K t = P t -H T (m-)S^ 

m t = rrit + K t v t 

Pt - P t --K t S t Kj 

where h — cxp ^ t ^ r j^, , and r is a temperature parameter. The Jacobian 

2^ges exp{(^ t l s \/ T ) 

matrix H(m^) is defined as 
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Ej^j' cxp(ro t [j])cxp(m t [j']) _ , 



[ff(m t -)]^' = ^ ( ^ e _7 1 ( . mt " [ f )) 

' JJ ''' i rvn m 1 mvnmi. 



cxp(m t [j])cxp(m t [j']) 
(E^xp(m t -[j]))= 

Table[l]summarises the fictitious play algorithm when EKF is used to predict 
opponents strategies. 

At time t 

1. Player i maintains some estimations about his opponents propensity up 
to time t — 1, p{Qt-\\sl : t — 1). Thus he has an estimation of the mean 
m t _! and the covariance Pt-i of this distribution. 

2. Then Player i is updating his estimations about his opponents propensi- 
ties p(Qt\sl : t — 1) using equations, = m t -i, P t ~~ = Pt-i + Wt-i- 

3. Based on the weights of step 1 each player updates his beliefs about his 
opponents strategies using aUs^) = exp ( m t ^Jj T \ , 

1 v ' T,j> exp(m t (])/t) 

4. Choose an action based on the beliefs of step 3 according to best response. 

5. Observe opponent's action St- 

6. Update the propensities estimates using m t = + K t v t and 
P t = P t - - K t S t Kj. 

7. set t=t+l 



Table 1: EKF Fictitious Play algorithm 



4 Theoretical Results 

In this section we present the convergence results we obtained for games with at 
least one pure Nash equilibrium and players who have 2 available actions, s — 
(1, 2). We will denote as — s the action that a player does not choose, for example 
if Player i's opponent chooses action 1, s = 1 and hence — s — 2. Also we will 
denote as m[l] and m[2] the estimated means of opponent's propensity of action 
1 and 2 respectively. Similarly P[l, 1] and P[2, 2] will represent the variance of 
the propensity's estimation of action 1 and 2 respectively, and P[l, 2], P[2, 1] 
their covariance. 

The proposed algorithm has the following two properties: 

Proposition 1. If at iteration t of the EKF fictitious play algorithm, action s is 
played from Player i 's opponent, then the estimation of his opponent propensity 
to play action s increases, m t -i[s] < m t [s]. Also the estimation of his opponent 
propensity to play action —s decreases, m t -\[— s] > m t [—s] 

Proof. The proof of Proposition [l] is on Appendix |Aj □ 
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1,1 


0,0 
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0,0 


1,1 



Table 2: Simple coordination game 



Proposition [T] implies that players, when they use EKF fictitious play, learn 
their opponent's strategy and eventually they will choose the action that will 
maximise their reward base on their estimation. Nevertheless there are cases 
where players may change their action simultaneously and trapped in a cycle 
instead of converging in a pure Nash equilibrium. As an example we consider 
the game that is depicted in Table [2j 

This is a simple coordination game with two pure Nash equilibria the joint 
actions (£/, L) and (D, R). In the case were the two players start from joint action 
(U, R) or (D, L) and they always change their action simultaneously then they 
will never reach one of the two pure Nash equilibria of the game. 

Proposition 2. In a 2 x 2 game where the players use EKF fictitious play 
process to choose their actions, and the variance of the observation state is set 
to Z = rl + el, with high probability the two players will not change their action 
simultaneously infinitely often. We define e as a random number from normal 
distribution with zero mean and arbitrarily small covariance matrix, I is the 
identity matrix. 

Proof. The proof of Proposition [2] is on Appendix [B] □ 

We should mention here that the reason we set Z = rl + el is in order 
to break any symmetries that occurred because the initialisation of the EKF 
fictitious play algorithm. Based on Proposition[T]and[2]we can infer the following 
propositions and theorems. 

Proposition 3. (a) In a game where players have two available actions if s is a 
Nash equilibrium, and s is played at date t in the process of EKF fictitious play, 
s is played at all subsequent dates. That is, strict Nash equilibria are absorbing 
for the process of EKF fictitious play, (b) Any pure strategy steady state of EKF 
fictitious play must be a Nash equilibrium. 

Proof. Consider the case where players beliefs at, are such that their optimal 
choices correspond to a strict Nash equilibrium s. In EKF fictitious play pro- 
cess players' beliefs are formed identically and independently for each opponent 
based on equation Q. By Proposition [I] we know that players' estimations 
about their opponents' propensities and therefore their strategies, that each 
player maintains for the other players, will increase for the actions that are 
included in s and will be reduced otherwise. Thus the best response to their 
beliefs a t +\ will be again s and since s is a Nash equilibrium they will not de- 
viate from it. Conversely, if a player remains at a pure strategy profile, then 
eventually the assessments will become concentrated at that profile, because of 
Proposition [TJ hence if the profile is not a Nash equilibrium, one of the players 
would eventually want to deviate. □ 
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Proposition 4. Under EKF fictitious play, if the beliefs over each player's 
choices converge, the strategy profile corresponding to the product of these dis- 
tributions is a Nash equilibrium. 

Proof. Suppose that the beliefs of the players at time t, a t , converges to some 
profile a. If a were not a Nash equilibrium, some player would eventually want 
to deviate and the beliefs would also deviate since based on Proposition[T]players 
eventually learn their opponents actions. □ 

Theorem 1. The EKF fictitious play process converges to the Nash equilibrium 
in 2 x 2 games with at least one pure Nash equilibrium, when the covariance 
matrix of the observation space error, Z , is defined as in Proposition [j| Z = 
rl + el. 

Proof. We can distinct two possible initial states in the game. In the first 
players' initial beliefs of the players actions are such that their initial joint action 
s is a Nash equilibrium. From Proposition [3] and equation ^ we know that 
they will play the joint action which is a Nash equilibrium for all the iterations 
of the game. 

The second case where the initial beliefs of the players are such that their 
initial joint action sq is not a Nash equilibrium is divided in 2 subcategories. 
The first include 2x2 games with only one pure Nash equilibrium. In this case, 
one of the two players has a dominant action, thus for all the iterations of the 
game he will choose the dominant action. This action maximises his expected 
payoff regardless the other player's strategy and thus he will select this action 
in every iteration of the game. Therefore because of Proposition [l] the other 
player will learn his opponent's strategy and players will choose the joint action 
which is the pure Nash equilibrium. 

The second category includes 2x2 games with 2 pure Nash equilibria, like 
the simple coordination game that is depicted in Table [2j In this case players 
initial joint action sq = (s^s 2 ) is not a Nash equilibrium. Then the players 
will learn their opponent's strategy, Proposition [T] and Equation (JsJ) , and they 
will change their action. We know from Proposition [2] that in a finite time with 
high probability the players will not change their actions simultaneously, and 
hence they will end up in a joint action that will be one of the two pure Nash 
equilibria of the game. □ 

We can extend the results of Theorem [l] in n x 2 games with a better reply 
path. A game with a better reply path can be represented as a graph were its 
edges are the join actions of the game s and there is a vertex that connects 
s with s' iff only one player i can increasing his payoff by changing his action 



(Young 2005). Potential games have a better reply path. 



Theorem 2. The EKF fictitious play process converges to the Nash equilib- 
rium in n x 2 games with a better reply path when the covariance matrix of the 
observations space error, Z , is Z = r + el . 

Proof. Similarly to the 2x2 games if the initial beliefs of the players are such 
that their initial joint action so is a Nash equilibrium, from Proposition [3] and 
equation (J6|, we know that they will play the joint action which is a Nash 
equilibrium for the rest of the game. 
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Moreover in the case of the initial beliefs of the players are such that their 
initial joint action so is not a Nash equilibrium based on Proposition [l] and 
Proposition [2] after a finite number of iterations because the game has a better 
reply path the only player that can improve his payoff by changing his actions 
will choose a new action which will result in a new joint action s. If this action 
is not the a Nash equilibrium then again after finite number of iterations the 
player who can improve his payoff will change action and a new joint action s' 
will be played. Thus after the search of the vertices of a finite graph, and thus 
after a finite number of iterations, players will choose a joint action which is a 
Nash equilibrium. □ 



5 Simulations to define algorithm parameters S 
and Z. 

The covariance matrix of the state space error S = ql and the measurement 
error Z = rl are two parameters that we should define in the beginning of the 
EKF fictitious play algorithm and they affect its performance. Our aim is to 
find values, or range of values, of q and r that can efficiently track opponents' 
strategy when it smoothly or abruptly change, instead of choosing q and r 
hcuristically for each opponent when we use the EKF algorithm. Nevertheless 
it is possible that for some games the results of the EKF algorithm will be 
improved for other combinations of q and r than the ones that we propose in 
this section. 

We examine the impact of EKF fictitious play algorithm parameters in its 
performance in the following two tracking scenarios. In the first one a single 
opponent chooses his actions using a mixed strategy which changes smoothly and 
has a sinusoidal form over the iterations of the tracking scenario. In particular 

for t = 1,2,..., 100 iterations of the game: <7t(l) = c ° s f +1 = 1 — CTt (2) , 
where n = 100. In the second toy example Player i's opponent change his 
strategy abruptly and chooses action 1 with probability of (1) = 1 during the 
first 25 and the last 25 iterations of the game and for the rest iterations of 
the game of (1) = 0- The probability of the second action is calculated as: 
<x t 2 (2) = l^(l). 

We tested the performance of the proposed algorithm for the following range 
of parameters 10 -4 < q < 1 and 10~ 4 < r < 1. We repeated both examples 
100 times for each of the combinations of q and r. Each time we measured the 
absolute error of the estimated strategy against the real one. The combined 
average absolute error when both examples are considered is depicted on Figure 
[lj The darkest areas of the contour plot represent the areas where the average 
absolute error is minimised. 

The average absolute error is minimised for a range of values of q and r, 
that form two distinct areas. In the first area, the wide dark area of Figure 
[l] the range of q and r were 0.08 < q < 0.4 and 0.2 < r < 1 respectively. In 
the second area, the narrow dark area of Figure [lj the range of q and r were 
0.001 < q < 0.025 and 0.08 < r < 0.13 respectively. The minimum error which 
we observed in our simulations was in the narrow area and in particular when 
3 = 0.01/ and Z = 0.11, where / is the identical matrix. 
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Figure 1: Combined absolute error for both tracking scenarios. The range of 
both parameters, q and r is between 10 -4 and 1. 
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Table 3: Climbing hill game with three players. Player 1 selects rows, Player 2 
selects columns, and Player 3 selects the matrix. The global reward depicted in 
the matrices, is received by all players. The unique Nash equilibrium is in bold 



6 Simulation results 

This section is divided in two parts. The first part contains results of our simu- 
lations in two strategic form games and the second part contains the results we 
obtained in an ad-hoc sensor network surveillance problem. In all the simula- 
tions of this section we set the covariance matrix of the hidden and the observa- 
tions state to S = 0.017 and Z = (0.1 + e)I respectively, where e ~ N(0, 10~ 5 ) 
and I is the identical matrix. 



6.1 Simulations results in strategic form games 

In this section we compare the results of our algorithm with those of fictitious 
play in two coordination games. These games are depicted in Tables [2] and [3] 
The game that is depicted in Table[2j as it was described in Section|4], is a simple 
coordination game with two pure Nash equilibria, its diagonal elements. Table 



3 presents an extreme version of the climbing hill game ( Claus and Boutilier 



1998 ) in which three players must climb up a utility function in order to reach 
the Nash equilibrium where their reward is maximised. 

We present the results of 50 replications of a learning episode of 50 iterations 
for each game. As it is depicted in Figures [2] and [3] the proposed algorithm 
performs better than fictitious play in both cases. In the simple coordination 
game that is shown in Table [2] the EKF fictitious play algorithm converges to 
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Figure 2: Results of EKF and classic fictitious play in the simple coordination 
game of Table [2] 



one of the pure equilibria after a few iterations. On the other hand fictitious 
play is trapped in a limit cycle in all the replications where the initial joint 
action was not one of the two pure Nash equilibria. For that reason the players' 
payoff for all the iterations of the game was either 1 utility unit or utility 
units depending to the initial joint action. In the climbing hill game, Table [3] 
the proposed algorithm converges to the Nash equilibrium after 35 iterations 
when fictitious play algorithm do not converge even after 50 iterations. 



6.2 Ad-hoc sensor network surveillance problem. 

We compared the results of our algorithm against those of fictitious play in a 
coordination task of a power constrained sensor network, where sensors can be 



either in a sense or sleep mode (Farinclli et al. 2008 Chapman et al. 20111. 



When the sensors are in sense mode they can observe the events that occur in 
their range. During their sleep mode the sensors harvest the energy they need 
in order to be able function when they are in the sense mode. The sensors then 
should coordinate and choose their sense/sleep schedule in order to maximise 
the coverage of the events. This optimisation task can be cast as a potential 
game. In particular we consider the case where I sensors are deployed in an area 
where E events occur. If an event e, e € E, is observed from the sensors then it 
produce some utility V e . Each of the sensors i = 1, . . . , I should choose an action 
s z = j, from one of the j = 1, . . . , J time intervals which they can be in sense 
mode. Each sensor i when it is in sense mode can observe an event e, if it is 
in its sense range, with probability p ie = j-, where di e is the distance between 
the sensor i and the event e. We assume that the probability each sensor has 
to observe an event is independent from the other sensors. If we denote as ii n 
the sensors that are in sense mode when the event e occurs and e is in their 
sensing range, then we can write the probability an event e to be observed from 
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Figure 3: Probability of playing the (U,U,D) equilibrium for the EKF fictitious 
play (solid line) and fictitious play (dash line) for the three player climbing hill 
game 



the sensors, ii n as 

i- n ^-Pfa) 

The expected utility that is produced from the event e is the product of its 
utility V e and the probability it has to be observed by the sensors, i;„ that are 
in sense mode when the event e occurs and e is in their sensing range. More 
formally we can express the utility that is produced from an event e as: 

U e ( S )=V e (l- Yl 

The global utility is then the sum of the utilities that all events, e € E, produce 

Uglobal(s) = ^ C/ e( S )' 

e 

Each sensor after each iteration of the game receives some utility which is 
based on the sensors and the events that are inside his communication and sense 
range respectively. For a sensor i we denote e the events that are in its sensing 
range and s~ l the joint action of the sensors that are inside his communication 
range. The utility that sensor i will receive if his sense mode is j will be 

e 

We compared the performance of the two algorithms in 2 instances of the 
above scenario one with 20 and one with 50 sensors that are deployed in a unit 
square. In both instances sensors had to choose one time interval of the day that 
they will be in sense mode and use the rest time intervals to harvest energy. We 
consider cases where sensors had to choose their sense mode between 2, 3 and 
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4 available time intervals. Sensors are able to communicate with other sensors 
that are at most 0.6 distance units away, and can only observe events that are 
at most 0.3 distance units away. Moreover in both instances we assumed that 
20 events took place in the unite square area. Those events were uniformly 
distributed in space and time, so an event could evenly appear in any point of 
the unit square area and it could occur at any time with the same probability. 
The duration of each event was uniformly chosen between (0-6] hours and each 
event had a value V e G (0—1]. Figures [4] and [5] depict the average results 
of 50 replications of the game for the two algorithms. For each instance, both 
algorithms run for 50 iterations. To be able to average across the 50 replications 
we normalise the utility of a replication by the global utility that the sensors 
will gain if they were only in sense mode during the whole day. 



(a) Results when sensors have to (b) Results when sensors have to 
choose between two time intervals. choose between three time intervals. 



(c) Results when sensors have to 
choose between four time intervals. 

Figure 4: Results of the instance where 20 sensors should coordinate for both 
algorithms. The results of EKF fictitious play arc the solid lines and the results 
of the classic fictitious play are the dash lines. The horizontal axis of the figures 
depict the iteration of the game and the vertical axis the global utility as a 
percentage of the global utility of the system in the case that sensors were 
always in sense mode. 

As we observe in Figures [4] and [5] EKF fictitious play converges to a stable 
joint action faster than the fictitious play algorithm. In particular on average 
the EKF fictitious play algorithm needed 10 "negotiation" steps between the 
sensors in order to reach a stable joint action, when fictitious ply needed more 
than 25. Moreover the classic fictitious play algorithm was always resulted in 
joint actions with smaller reward than the proposed algorithm. 
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(a) Results when sensors have to (b) Results when sensors have to 
choose between two time intervals. choose between three time intervals. 




(c) Results when sensors have to 
choose between four time intervals. 

Figure 5: Results of the instance where 50 sensors should coordinate for both 
algorithms. The results of EKF fictitious play are the solid lines and the results 
of the classic fictitious play are the dash lines. The horizontal axis of the figures 
depict the iteration of the game and the vertical axis the global utility as a 
percentage of the global utility of the system in the case that sensors were 
always in sense mode. 

7 Conclusion 

We have introduced a variation of fictitious play that uses Extended Kalman 
filters to predict opponents' strategies. This variation of fictitious play addresses 
the implicit assumption of the classic algorithm that opponents use the same 
strategy in every iteration of the game. 

We showed that, for 2 x 2 games with at least one pure Nash equilibrium, 
EKF fictitious play converges in the pure Nash equilibrium of the game. More 
over the proposed algorithm converges in games with a better reply path, like 
potential games, and n players that have 2 available actions. 

EKF fictitious play performed better than the classic algorithm algorithm 
in the strategic form games and the ad-hoc sensor network surveillance problem 
we simulated. Our empirical observations indicate that EKF fictitious play con- 
verges to a solution that is better than the classic algorithm and needs only a 
few iterations to reach that solution. Hence by slightly increasing the computa- 
tional intensity of fictitious play less communication is required between agents 
to quickly coordinate on a desired solution. 
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A Proof of Proposition 1 

We will base the proof of Proposition [T] on the properties of EKF when they 
used to estimate opponent's strategy with two available actions. If player i's 
opponent has two available actions 1 and 2, then we can assume that at time 
t— 1 Player i maintains beliefs about his opponent's propensity, with mean m t _i 
and variance Pt-%. Moreover based on these estimations he chooses his strategy 
er t _i. At the prediction step of this process he uses the following equations to 
predict his opponent's propensity and choose an action using best response. 

n»-i[l] 
n*"-i[2] 

P -_( KxM \ T 

1 \ KiM KiM ) 1 
without loss of generality we can assume that his opponent in iteration t chooses 
action 2. Then the update step will be : 

v t = Zt — h(m,f) 

since Players i's opponent played action 2 and h — ea; p('3d s ]/^) we can 
write Vt and H t (m^~) as: 



Vi 



0\_f a t -i(i) 
I J \l-<r t -i(l) 



H t (mT) 



a t -a t 
~a t a t 



where at is defined a t = <7t_i(l)<7{_i(2). The estimation of St = H(m t )P t H T (m t ) + Z 
will be: 

where b = Pf [1 , 1]+Pf [2, 2] - 2P~ [1,2]. The Kalaman gain, K t = Pf H T (m t ~ ) 5f 1 
can be written as 



K= 1 (KM k \(a t -a t \(b + r b \ 
* 2rb + r 2 \ k Pi[2,2] J \ -a t a t J \ b b + r J 

up to a multiplicative constant we can write 
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where c = Pf [1,1]- Pf [1, 2] and <f = Pf [2, 2] - Pf [1,2]. The updates then for 
the mean and variance are: 



Pt =Pf - K t S t Kl 



The mean of the Gaussian distribution that is used to estimate opponent's 
propensities is: 

* " W2] J - ^ mr[2] + MD ^AJU) / 

Based on the above we observe that m t (l) < m f _i(l) and m t (2) > m t _i(2) 
which completes the proof. 

B Proof of Proposition 2 

We consider 2x2 games with at least one pure Nash equilibrium. In the case 
that only one Nash equilibrium exists, a dominant strategy exists and thus one 
of the players will not deviate from this action. Hence we are interested in 
in 2 x 2 games with two pure Nash equilibria. Without loss of generality we 
consider a game with similar structure to the simple coordination game that is 
depicted in Table [2j with two equilibria, the joint actions in the diagonal of the 
payoff matrix, (U,L) and (D,R). We will present calculations for Player l,but 
the same results hold also for Player 2. We define A as the necessary confidence 
level that Player l's estimation of crfL) should reach in order to choose action 
U. Hence we Player 1 will choose D if: 

cr t (l) > A^ 

exp(mf[l]) 



exp(m t [1]) + exp(m t [2]) 



> A 



m t [1] > Hj^j) + mt [2] 4* 

m t -x[l] > ln(— -) + mt_i[2] 

In order to prove Proposition [2] we need to show that when a player changes 
his action his opponent will change his action at the same iteration with proba- 
bility less than 1. In the case where at time t—1 the joint action of the players 
is U, R then Player 1 believes that his opponent will play L, while he observing 
him playing R. Assume that Player 2's beliefs about Player l's strategies has 
reached the necessary confident level about Players l's strategy and at iteration 
t he will change his action from R to L. Player 1 will also change his action at 
the same time if 

rot_i[2] > ln(i^) + m 4 -i[l] 

We want to show that players will not change actions simultaneously with prob- 
ability 1. Hence it is enough to show that 
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Pr6b(mt-i[l] > Mv^v) + m t -i[2]) > (8) 

1 — A 

We can replace m i _ 1 [l] and m i _ 1 [2] with their equivalent from Q and write: 



^ ^VVi^) > M 1 ^)+m t -[2]-m t "[l]^ 
a(b-k) ln(j^)+mi[2]-mj[l] 



4a 2 (6- fc) + (r + e) -4<r(l) 
Solving this with respect to e we have 

e> ^-k)a(l) a 2 (b _ k) _ r 

l n(T A_) +TO -[ 2 ]_ mr [l] 

Thus we can write ^ as: 

Prob(e> - . a 2 (6 _ fc) _ r)>0 ( 9 ) 

Since e is a Gaussian white noise Q is always true. 

We also consider the case where at time t — 1 the joint action of the players 
is D, L then Player 1 believes that his opponent will play R, while he observing 
him playing L. Assume that Player 2's beliefs about Player l's strategies has 
reached the necessary confident level and at t he will change his action from L 
to R. Player 1 will also change his action at the same time if 

rot-ill] > ln(— -) + ro t _i[2] 

We want to show that Players will not change actions simultaneously with prob- 
ability 1. Hence it is enough to show that 

Profe(m t _i[2] > ln(^-— -) +m t _i[l]) > (10) 
A 



We can rewrite ( |10| using the results we obtained for m t _i[l] and m t _i[2] in 
([7]) again as 

ProKe > . ° ( *~ _ m - Ab - *) - r) > (11) 



Since e is a Gaussian white noise ( |11[ ) is always true. 

If we define £ t the event that both players change their action at time t simul- 
taneously, and assume that the two players have change their actions simultane- 
ously at the following iterations t±, £2, ■ ■ ■ , U, then the probability that they will 
also change their action simultaneously at time tr+i, -P(£ti)£t 2 > ■ ■ ■ i£,t T ,£,t T +i) 
is almost zero for large but finite T. 
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